Task 2: Exploratory Data Analysis on Titanic Dataset¶
📦 Import libraries¶
In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Load & Preview dataset¶
In [8]:
df = pd.read_csv(r"C:\Users\Maged\Desktop\TASK 2\titanic\train.csv")
df.head()
Out[8]:
🧹 Data Cleaning¶
1. Check for missing values¶
In [45]:
df.isnull().sum()
Out[45]:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
2. Fill missing Age with median, Embarked with mode, drop Cabin¶
In [47]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns=['Cabin'], inplace=True)
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\3046717843.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df['Age'].fillna(df['Age'].median(), inplace=True)
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\3046717843.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
3. Confirm changes¶
In [49]:
df.isnull().sum()
Out[49]:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 dtype: int64
🧾 Summary Stats & Group Insights¶
1. Basic statistics¶
In [32]:
df.describe()
Out[32]:
2. Survival rate by gender¶
In [17]:
df.groupby('Sex')['Survived'].mean()
Out[17]:
Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64
3. Survival by class¶
In [23]:
df.groupby('Pclass')['Survived'].mean()
Out[23]:
Pclass 1 0.629630 2 0.472826 3 0.242363 Name: Survived, dtype: float64
4. Total Passengers, Total Survivors & Survival Rate¶
In [78]:
total_passengers = df.shape[0]
survivors = df['Survived'].sum()
survival_rate = round((survivors / total_passengers) * 100, 2)
print(f" Total Passengers: {total_passengers}")
print(f" Total Survivors: {survivors}")
print(f" Survival Rate: {survival_rate}%")
Total Passengers: 891 Total Survivors: 342 Survival Rate: 38.38%
📊 Visualizations¶
In [57]:
# Set plot style
sns.set(style="darkgrid")
1. Count of survivors by sex¶
In [59]:
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival Count by Gender')
plt.show()
2. Survival by passenger class¶
In [63]:
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title('Survival Count by Passenger Class')
plt.show()
3. Age distribution by survival¶
In [66]:
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', shade=True)
sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', shade=True)
plt.title('Age Distribution by Survival')
plt.legend()
plt.show()
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\1747110081.py:2: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', shade=True) C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\1747110081.py:3: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', shade=True)
4. Heatmap of correlations¶
In [71]:
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include='number')
# Now generate the correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()
5. Age Distribution by Survival¶
In [89]:
sns.histplot(data=df, x='Age', hue='Survived', kde=True, multiple='stack')
plt.title('Age Distribution by Survival')
plt.show()